15. Evaluation Metrics

Confusion Matrix: False Alarms

In [1]:
# just a helper function for easier youtube call
def strip_url(url):
    return url.replace('https://youtu.be/','')
In [2]:
from IPython.display import YouTubeVideo
YouTubeVideo(strip_url('https://youtu.be/611qWzIxGmU'))
Out[2]:

Answer is 1. And note the asymmtry. We have this as one type of false alarm, say PN (that is, burglary predicted when its not), and other, NP, where there is no alarm, when actual burglary happenning.

Obviously we would want to reduce no of NP types, as burglary would have happened while weare not aware which is much costlier than having PN type false alarms (just alarming us, when acutally no burglary happened).

Thus the nature of problem may push us towards skewing our prediction (here, for ex, boundary line) so that NP types are minimized (even if that means, increasing PN types)

**Classifying Chavez Correctly 1**

In [3]:
url = 'https://youtu.be/0PFq8zoaNWU'
YouTubeVideo(strip_url(url))
Out[3]:

What is the probability that HC would be classifed correctly by our learning algorithm?

This is a little bit tricky question. Remember the columns are also in same order, but indicating predictions. So we need to look carefully.

HC.png

The row could give us total number of HC (for eg, for cell (row,col) = (6,2)), value is 3, which means, 3 times Hugo Chavez has been predicted as Colin Powell. So there goes 3 HC as CP. Similarly for other cells in that same row. Totally no of HCs predicted as others and also himself, which is 16.

The col 6 could give us, when exactly HC was predicted as HC himself, correctly. That is 10 times.

Probability of something is fraction of no of times that something happens to all related outcomes.

Probability of predicting HC would be fraction of no of times HC predicted correctly, to total no of HC predictions (all HC predictions including correct and wrong)

$$ \displaystyle p(\text{HC predicted correctly) = }\frac{{n(\text{HC predicted correctly)}}}{{n(\text{HC}\text{ predicted)}}}=\frac{{10}}{{16}}=0.625$$

As we would see shortly , this is called Recall


$$ \displaystyle \text{Recall =}\frac{{(\text{prediction is reality)}}}{{(\text{prediction is reality)+(prediction is not reality ; type 1)}}}=\frac{{\text{True Positive}}}{{\text{True Positive +}\text{ False Negative}}}$$

The type 1 prediction not being reality is no of times algorithm predicting wrongly as others while it was HC in reality. This is False Negative.

**Classifying Chavez Correctly 2**

In [4]:
url = 'https://youtu.be/HWW9BNHnPo0'
YouTubeVideo(strip_url(url))
Out[4]:

Given predictor says its HC, what is probability is it truly HC?

This is another tricky question, and no need not go in to Bayesian conditional probability yet. The confusion matrix itself could provide the answer.
HC2.png
The 6th col in confusion matrix is our algo predicting Hugo(?!). For eg, cell(row,col) = (2,6) indicate value as 0, which means, algo predicted as HC 0 times, when it is actually Colin Powell.

So this way, we can see, our algo does not predict any one else as HC, except HC himself. Thus probability is 1.

$$ \displaystyle p(\text{True HC }\text{ }\!\! | \!\!\text{ }\text{ predicted as HC) = }\frac{{n(\text{predicted as HC and its True HC)}}}{{n(\text{predicted as HC)}}}=\frac{{10}}{{10}}=1$$

As we would see shortly , this is called Precision


$$ \displaystyle \text{Precision =}\frac{{(\text{prediction is reality)}}}{{(\text{prediction is reality)+(prediction is not reality ; type 2)}}}=\frac{{\text{True Positive}}}{{\text{True Positive +}\text{ False Positive}}}$$

The type 2 prediction not being reality is no of times algorithm predicting wrongly as HC while it was others in reality. This is False Positive.

Powell Precision and Recall

In [5]:
url = 'https://youtu.be/QWWq77k-K_0'
YouTubeVideo(strip_url(url))
Out[5]:

Calculate Recall and Precision for "Colin Power"..
2018-06-25_13h18_38.png

Answer:
$$ \displaystyle \begin{array}{l}\text{Recall=}\frac{{55}}{{55+8}}=0.873\\\text{Precision=}\frac{{55}}{{55+(4+1+3+1+3)}}=0.821\end{array}$$

Bush Precision and Recall

In [6]:
url = 'https://youtu.be/8fM13xqU2a8'
YouTubeVideo(strip_url(url))
Out[6]:

Calculate Recall and Precision for "Bush"..
Bush.png

Answer:
$$ \displaystyle \begin{array}{l}\text{Recall=}\frac{{123}}{{123+(3+1)}}=0.968\\\text{Precision=}\frac{{123}}{{123+(1+8+8+7+2+7)}}=0.788\end{array}$$

True Positives in Eigenfaces

In [7]:
url = 'https://youtu.be/bgT8sWuV2lc'
YouTubeVideo(strip_url(url))
Out[7]:

/

Remember, True Positives, or False Positives or False Negatives are relative to the subject in focus. Always.

Here the subject is Tony Blair

So,

True Positive => Algorithm predicts Tony Blair and Reality is Tony Blair
False Positive => Algorithm predicts Tony Blair but Reality is others
False Negative => Algorithm predicts others but Reality is Tony Blair

For True positives, thus, its 26 truepositives.png

False Positives in Eigenfaces

In same line of thought,.. image.png

Thus ans is 8

False Negatives in Eigenfaces

In same line of thought.. image.png

Thus ans is 8

Practicing TP, FP, FN with Rumsfeld

In same line of thought..
Rum.png

TP = 25
FP = 1+1 = 2
FN = 1+8+2 = 11

Remember, the generalized notion

Given a Subject,

True Positive => Algorithm predicts Subject and Reality is Subject
False Positive => Algorithm predicts Subject but Reality is others
False Negative => Algorithm predicts others but Reality is Subject

Positive => always predicting the Subject and Negative, otherwise.
True => always Reality inline with Prediction and False, otherwise

Applying Metrics to Your POI Identifier

Udacity:
Go back to your code from the last lesson, where you built a simple first iteration of a POI identifier using a decision tree and one feature.
Copy the POI identifier that you built into the skeleton code in evaluation/evaluate_poi_identifier.py
Recall that at the end of that project, your identifier had an accuracy (on the test set) of 0.724. Not too bad, right? Let's dig into your predictions a little more carefully.

In [8]:
# evaluate_poi_identifier.py
#!/usr/bin/python

"""
    Starter code for the evaluation mini-project.
    Start by copying your trained/tested POI identifier from
    that which you built in the validation mini-project.

    This is the second step toward building your POI identifier!

    Start by loading/formatting the data...
"""

import pickle
import sys
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit

data_dict = pickle.load(open("../17. Final Project/final_project_dataset.pkl", "r") )

### add more features to features_list!
features_list = ["poi", "salary"]

data = featureFormat(data_dict, features_list)
labels, features = targetFeatureSplit(data)

### your code goes here 
In [9]:
# code from last section 'validation'
from sklearn import tree
from sklearn.metrics import accuracy_score

def classify(features_train, labels_train):

    ### your code goes here--should return a trained decision tree classifer
    X = features_train
    Y = labels_train
    clf = tree.DecisionTreeClassifier()
    clf = clf.fit(X,Y)
    return clf

from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, test_size=0.3, random_state=42)

# train
clf = classify(features_train, labels_train)  # now we make sensible choice..

# ask classifier to predict
labels_pred = clf.predict(features_test)  # and we are mindlessly asking it to predict any of the all data

acc = accuracy_score(labels_test, labels_pred)
acc
C:\Users\parthi2929\Anaconda3\envs\py2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Out[9]:
0.7241379310344828

Number of POIs in Test Set

Udacity: How many POIs are predicted for the test set for your POI identifier? (Note that we said test set! We are not looking for the number of POIs in the whole dataset.)

In [10]:
(labels_pred == 1).sum()  # no of times 1 occurs in the ndarray
Out[10]:
4

Number of People in Test Set

In [11]:
len(labels_pred)
Out[11]:
29

Accuracy of a Biased Identifier

Udacity: If your identifier predicted 0. (not POI) for everyone in the test set, what would its accuracy be?

In [12]:
labels_test.count(1)  # actual no of POI in test data
labels_dummy = [0 for _ in xrange(len(labels_test))] # if predictor is 0 for all in test data..
acc = accuracy_score(labels_test, labels_dummy) # then what is the accuracy..
acc
Out[12]:
0.8620689655172413

Number of True Positives

Udacity: Look at the predictions of your model and compare them to the true test labels. Do you get any true positives? (In this case, we define a true positive as a case where both the actual label and the predicted label are 1)

In [13]:
count = 0
for i,j in zip(labels_test, labels_pred):
    if i==1 and j==1:  # both actual and predicted are true => True Positive
        count += 1
count
Out[13]:
0

Unpacking Into Precision and Recall

Udacity: As you may now see, having imbalanced classes like we have in the Enron dataset (many more non-POIs than POIs) introduces some special challenges, namely that you can just guess the more common class label for every point, not a very insightful strategy, and still get pretty good accuracy!

Precision and recall can help illuminate your performance better. Use the precision_score and recall_score available in sklearn.metrics to compute those quantities.

What's the precision?

In [14]:
from sklearn.metrics import precision_score
precision_score(labels_test,labels_pred)
Out[14]:
0.0

Note this is because there is no true positive, or no of true positive is 0 currently. True positive forms the numerator.

Recall of Your POI Identifier

Udacity: What's the recall?

(Note: you may see a message like UserWarning: The precision and recall are equal to zero for some labels. Just like the message says, there can be problems in computing other metrics (like the F1 score) when precision and/or recall are zero, and it wants to warn you when that happens.)

Obviously this isn't a very optimized machine learning strategy (we haven't tried any algorithms besides the decision tree, or tuned any parameters, or done any feature selection), and now seeing the precision and recall should make that much more apparent than the accuracy did.

In [15]:
from sklearn.metrics import recall_score
recall_score(labels_test,labels_pred)
Out[15]:
0.0

Note this is because there is no true positive, or no of true positive is 0 currently. True positive forms the numerator.

Made up Predictions..

How Many True Positives?

Udacity: Here are some made-up predictions and true labels for a hypothetical test set; fill in the following boxes to practice identifying true positives, false positives, true negatives, and false negatives. Let's use the convention that '1' signifies a positive result, and '0' a negative.

predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1]
true labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]

How many true positives are there?

In [16]:
predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1] 
true_labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]
count = 0
for i,j in zip(true_labels, predictions):
    if i==1 and j==1:  # both actual and predicted are true => True Positive
        count += 1
count
Out[16]:
6

How Many True Negatives?

Remember, Negative => Others predicted, not Subject
True => Reality in line with Prediction

So, True Negatives => Algorithm predicts others, and reality is also others.

In [17]:
predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1] 
true_labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]
count = 0
for i,j in zip(true_labels, predictions):
    if i==0 and j==0:  # both actual and predicted are false
        count += 1
count
Out[17]:
9

False Positives?

Positive => Subject predicted
False => Prediction not in line with Reality

False Positive => Algorighm predicts Subject (1) but Reality is others (0)

In [18]:
predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1] 
true_labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]
count = 0
for i,j in zip(true_labels, predictions):
    if i==0 and j==1:  
        count += 1
count
Out[18]:
3

False Negatives??

Negative => Algorithm predicts others
False => Prediction not in line with Reality

False Negative => Algorighm predicts others (0) but Reality is Subject (1)

In [19]:
predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1] 
true_labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]
count = 0
for i,j in zip(true_labels, predictions):
    if i==1 and j==0:  
        count += 1
count
Out[19]:
2

Precision


$$ \displaystyle \text{Precision =}\frac{{(\text{Prediction : Subject, Reality : Subject)}}}{{(\text{Prediction : Subject, Reality : Subject)+(Prediction : Subject, Reality : Others)}}}=\frac{{\text{True Positive}}}{{\text{True Positive + False Positive}}}$$

In [20]:
predictions = [0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1] 
true_labels = [0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0]

TP_count, FP_count = 0,0
for reality, prediction in zip(true_labels, predictions):

    # TP is both Pred and Reality, Subject
    if reality == 1 and prediction == 1:
        TP_count += 1

    # FP is Prediction is Subject, Reality otherwise.
    if reality == 0 and prediction == 1:
        FP_count += 1

from __future__ import division  # python 2 limitation
Precision = TP_count/(TP_count+FP_count)
Precision
Out[20]:
0.6666666666666666

Recall


$$ \displaystyle \text{Precision =}\frac{{(\text{Prediction : Subject, Reality : Subject)}}}{{(\text{Prediction : Subject, Reality : Subject)+(Prediction : Others, Reality : Subject)}}}=\frac{{\text{True Positive}}}{{\text{True Positive + False Negative}}}$$

In [21]:
TP_count, FN_count = 0,0
for reality, prediction in zip(true_labels, predictions):

    # TP is both Pred and Reality, Subject
    if reality == 1 and prediction == 1:
        TP_count += 1

    # FN is Prediction is Others, Reality otherwise.
    if reality == 1 and prediction == 0:
        FN_count += 1

Recall = TP_count/(TP_count+FN_count)
Recall
Out[21]:
0.75

Making Sense of Metrics 1

Udacity: 'My true positive rate is high, which means that when a ___ is present in the test data, I am good at flagging him or her.'

Choices: POI, non-POI

Ans: True Positive => Both Prediction and Reality are Subject. So if this is more, then we are good at predicting the POI.

Making Sense of Metrics 2

making%20sense.png

Let us recall, Recall and Precision definitions, rewrite them so as to make some sense, relatable to the question in hand.


$$ \displaystyle \begin{array}{*{20}{l}} {\text{Precision}} \\ {=\dfrac{{\text{True Positive}}}{{\text{True Positive + False Positive}}}=\dfrac{{\text{TP}}}{{\text{TP+FP}}}} \\ {\text{=}\dfrac{{(\text{Prediction : Subject, Reality : Subject)}}}{{(\text{Prediction : Subject, Reality : Subject)+(Prediction : Subject, Reality : Others)}}}} \\ \begin{array}{l}=\dfrac{{(\text{Prediction : Subject, Reality : Subject)}}}{{\text{Prediction: Subject (irrespective of Reality)}}}\\\end{array} \end{array}$$


$$ \displaystyle \begin{array}{*{20}{l}} {\text{Recall}} \\ {=\dfrac{{\text{True Positive}}}{{\text{True Positive + False Negative}}}=\dfrac{{\text{TP}}}{{\text{TP+FN}}}} \\ {\text{=}\dfrac{{(\text{Prediction : Subject, Reality : Subject)}}}{{(\text{Prediction : Subject, Reality : Subject)+(Prediction : Others, Reality : Subject)}}}} \\ {=\dfrac{{(\text{Prediction : Subject, Reality : Subject)}}}{{\text{Reality: Subject (irrespective of Prediction)}}}} \end{array}$$

Let us talk about influence of performance on these parameters (so we could understand the inverse).

Take a look at both in short form again.

$$ \displaystyle \text{Precision = }\frac{{\text{TP}}}{{\text{TP+FP}}}\text{ Recall=}\frac{{\text{TP}}}{{\text{TP+FN}}}$$

One can note, the <b>TP</b> value, if increasing or decreasing is going to have <b>same effect</b> on both Precision and Recall values. If TP increases, both precision and recall would increase (provided FP and FN not varying). So a high TP could not be indicative measure for both high precision and high recall.

Thus, let us consider the other two influencers in the equation. FP and FN.

Now one could easily observe, increasing FP decreases Precision. So higher the False Positives, lower the Precision. In other words, higher the prediction: subject, reality: others, lower the precsion. For case in our hand, prediction: POI, reality: non-POI. So low precision indicates more non-POI are wrongly flagged as POI. Lower the precision, more non-POIs get wrongly flagged as POIs which is naturally not desired, as innocent non-POIs could be scrutized as POIs.

So generally, we prefer, high Precision (so less non-Subjects scrutinized unnecessarily)

Similarly, increasing FN decreases Recall. So higher the False Negatives, lower the Recall. In other words, higher the 'prediction: others, reality: subject', lower the Recall. For case in our hand, prediction: non-POI, reality: POI. So low recall indicates more POIs wrongly flagged as non-POIs. Lower the recall, more POIs get wrongly flagged as non-POIs which is naturally not desired, as POIs could escape.

So generally, we also prefer, high Recall (so less Subjects could 'escape')

Now, coming to question, author says these..


My identifier does not have a great X but it does have good Y.

This could indicate, out of 3 candidates, P (Precision), R (Recall), F1, two are quantified, out of which, one is low, another is high.



That means, nearly everytime a POI shows up in test set, I am able to identify him or her

This could mean high TP. nearly everytime reality is POI, prediction is POI. So could mean, high P or high R or high F1.
This could also mean, low FN. That is, predicting some one as non-POI while in reality he/she is POI, is low. Thus low FN or high recall. (less instances of Prediction: non-POI, Reality: POI) I fainted.



The cost of this is that, sometimes I get false positives, where non-POIs get flagged

That is predicting some one as POI, while in reality he/she is non-POI, is increased as a cost for tuning the algorithm for high recall.
That is, high FP or low Precision (more instances of Prediction: POI, Reality: non-POI).

The author has tuned the algorithm to reduce instances of POIs getting flagged as non-POIs (or minimizing POIs escaping), at the cost of more non-POIs getting flagged as POIs.


Making Sense of Metrics 3

'My identifier doesn't have great _, but it does have good __. That means that whenever a POI gets flagged in my test set, I know with a lot of confidence that it's very likely to be a real POI and not a false alarm. On the other hand, the price I pay for this is that I sometimes miss real POIs, since I'm effectively reluctant to pull the trigger on edge cases.'


My identifier does not have a great X but it does have good Y.

This could indicate, out of 3 candidates, P (Precision), R (Recall), F1, two are quantified, out of which, one is low, another is high.



That means that whenever a POI gets flagged in my test set, I know with a lot of confidence that it's very likely to be a real POI and not a false alarm

Here, prediction: POI, more of reality: POI, less of reality: non-POI
In other words, more of (prediction: POI, reality: POI), less of (prediction: POI, reality: non-POI). When prediction is subject for both cases of reality given, we could directly head to Precision. Thus, less of (prediction: POI, reality: non-POI) implies less False Positives, that is decreased FP or high Precision.



On the other hand, the price I pay for this is that I sometimes miss real POIs, since I'm effectively reluctant to pull the trigger on edge cases

Missing real POIs, implies prediction: non-POI, while reality: POI. Its False Negative. Its said as the price paid, implying, False Negative increased as a result of tuning. False Negative because Algorithm predicts others, while reality is Subject. Thus increased FN or low Recall

Making Sense of Metrics 4

'My identifier has a really great _.

This is the best of both worlds. Both my false positive and false negative rates are _, which means that I can identify POI's reliably and accurately. If my identifier finds a POI then the person is almost certainly a POI, and if the identifier does not flag someone, then they are almost certainly not a POI.'


My identifier has a really great \. This is the best of both worlds._

Remember when we said, higher F1 could imply any of underlying P or R being the cause? F1 = 1, that is max, means both precision and recall are at best, or max or 1. So author here could imply F1 score



Both my false positive and false negative rates are , which means that I can identify POI's reliably and accurately_.

Assuming high F1 because of both high P and high R, here, they could be high, only if their False counterparts are low. That is, high P because of low FP, and high R because of low FN. So here, blank line could be filled with low but lets go on.



If my identifier finds a POI then the person is almost certainly a POI..

If prediction:POI, reality: POI almost. High True Positive. Also suggests high P and high R, thus low FN and FPs.



and if the identifier does not flag someone, then they are almost certainly not a POI.

If prediction: others, reality: others. High True Negative. Says nothing much. Let us give a shot of 'F1 score, and low'

Ans: F1 Score/Low is correct :)

Metrics for Your POI Identifier

Udacity: There's usually a tradeoff between precision and recall--which one do you think is more important in your POI identifier? There's no right or wrong answer, there are good arguments either way, but you should be able to interpret both metrics and articulate which one you find most important and why.

Ans: It depends.

In case of POIs, I would decrease POIs escaping as non-POIs (prediction: non-POI, reality: POI. In other words, reduce False Negatives or increase Recall) even at the increased cost of non-POIs getting scrutinized as POIs (prediction: POI, reality: non-POI. In other words, increased False Positives or decreased Precision).

Shortly in case of POIs, I would increase Recall at the cost of decreased Precision.